Chemical Species Ontology for Data Integration and Knowledge Discovery

Web ontologies are important tools in modern scientific research because they provide a standardized way to represent and manage web-scale amounts of complex data. In chemistry, a semantic database for chemical species is indispensable for its ability to interrelate and infer relationships, enabling a more precise analysis and prediction of chemical behavior. This paper presents OntoSpecies, a web ontology designed to represent chemical species and their properties. The ontology serves as a core component of The World Avatar knowledge graph chemistry domain and includes a wide range of identifiers, chemical and physical properties, chemical classifications and applications, and spectral information associated with each species. The ontology includes provenance and attribution metadata, ensuring the reliability and traceability of data. Most of the information about chemical species are sourced from PubChem and ChEBI data on the respective compound Web pages using a software agent, making OntoSpecies a comprehensive semantic database of chemical species able to solve novel types of problems in the field. Access to this reliable source of chemical data is provided through a SPARQL end point. The paper presents example use cases to demonstrate the contribution of OntoSpecies in solving complex tasks that require integrated semantically searchable chemical data. The approach presented in this paper represents a significant advancement in the field of chemical data management, offering a powerful tool for representing, navigating, and analyzing chemical information to support scientific research.


Classes in OntoSpecies
A list of all classes in OntoSpecies and their description is reported in Table S1 in alphabetical order.The definitions of namespace prefixes used in Table S1 are summarised in Table 2 in the main text.The table also includes the parent class and the corresponding CHEMINF or CHMO equivalent class when applicable.A class is linked to the parent class and equivalent class through the predicates rdfs:subClassOf and owl:equivalentClass, respectively.Table S1: List of classes in OntoSpecies and their description.Parent class and equivalent class are also reported when applicable.

os:11BNMRSpectra
Boron-11 NMR spectroscopy (also known as 11B NMR) is a version of NMR spectroscopy used to elucidate the structure of boron-containing compounds.
• Subclass of os:1DNMRSpectra • Equivalent to CHMO:CHMO_0000843 os:13CNMRSpectra Carbon-13 NMR spectroscopy (also known as 13C NMR) is a version of NMR spectroscopy used to elucidate the structure of carbon-containing compounds.
• Subclass of os:1DNMRSpectra • Equivalent to CHMO:CHMO_0000837 os:15NNMRSpectra Nitrogen-15 NMR spectroscopy (also known as 15N NMR) is a version of NMR spectroscopy used to elucidate the structure of nitrogen-containing compounds.
• Subclass of os:1DNMRSpectra • Equivalent to CHMO:CHMO_0000844 os:17ONMRSpectra Oxygen-17 NMR spectroscopy (also known as 17O NMR) is a version of NMR spectroscopy used to elucidate the structure of oxygen-containing compounds.

S-2 os:19FNMRSpectra
Fluorine-19 NMR spectroscopy (also known as 19F NMR) is a version of NMR spectroscopy used to elucidate the structure of fluorine-containing compounds.
• Subclass of os:1DNMRSpectra • Equivalent to CHMO:CHMO_0000845 os:29SiNMRSpectra Silicon-29 NMR spectroscopy (also known as 29Si NMR) is a version of NMR spectroscopy used to elucidate the structure of silicon-containing compounds.
• Subclass of os:1DNMRSpectra • Equivalent to CHMO:CHMO_0001955 os:31PNMRSpectra Phosphorus-31 NMR spectroscopy (also known as 31P NMR) is a version of NMR spectroscopy used to elucidate the structure of phosphorus-containing compounds.
• Subclass of os:2DNMRSpectra • Equivalent to CHMO:CHMO_0002420 os:1HNMRSpectra Hydrogen-1 NMR spectroscopy (also known as H1 NMR or proton NMR) is a version of NMR spectroscopy used to elucidate the structure of hydrogen-containing compounds.
• Subclass of os:1DNMRSpectra • Equivalent to CHMO:CHMO_0002419 os:2DNMRSpectra Two-dimensional NMR spectroscopy is a set of nuclear magnetic resonance spectroscopy (NMR) methods, which give data plotted in a space defined by two frequency axes.

os:AtomChiralCount
Atom stereocenter (atom that is related to four distinct atoms) count.
• Subclass of os:Property • Equivalent to CHEMINF:CHEMINF_000205 os:AtomChiralDefCount Defined atom stereocenter (atom that is related to four distinct atoms) count.
• Subclass of os:Property • Equivalent to CHEMINF:CHEMINF_000206 os:AtomChiralUndefCount Undefined atom stereocenter (atom that is related to four distinct atoms) count.

os:AtomicBond
Bond between two atoms.

os:AtomicRadius
Radius of an atom.
• Subclass of os:Property • Equivalent to CHEMINF:CHEMINF_000125 os:AtomicWeight Mass of an atom.

S-4
os:AutoignitionTemperature The lowest temperature at which the substance will spontaneously ignite in a normal atmosphere without an external source of ignition (e.g., spark or flame).
• Subclass of os:Property • Equivalent to CHEMINF:CHEMINF_000444 os:BoilingPoint The temperature at which this compound changes state from liquid to gas at a given atmospheric pressure.
• Subclass of os:ThermoProperty A set of concepts and categories in a subject area or domain that shows their properties and the relations between them.

os:CollisionCrossSection
Collision cross section represents the effective area for the interaction between an individual ion and the neutral gas through which it is traveling (e.g., in ion mobility spectrometry experiments).It quantifies the probability of a collision taking place between two or more particles.
• Subclass of os:Property

os:CompoundComplexity
Indicator that denotes how complicated a structure is.
• Subclass of os:Property • Equivalent to CHEMINF:CHEMINF_000390 os:CovalentUnitCount The number of covalent units in a chemical structure.
• Subclass of os:Property reversibly into smaller components, as when a complex falls apart into its component molecules, or when a salt splits up into its component ions.This includes pKa (the negative logarithm of the acid dissociation constant) and pKb (the negative logarithm of the base dissociation constant).
• Subclass of os:ThermoProperty os:ElectronAffinity Amount of energy released when an electron attaches to a neutral atom or molecule in the gaseous state to form an anion.
• Subclass of os:Property

os:ElectronConfiguration
Arrangement of electrons in orbitals around an atomic nucleus.
• Subclass of os:Property

os:Electronegativity
Electronegativity is an atomic quality that describes its power to attract electrons to itself.
• Subclass of os:Property • Equivalent to CHEMINF:CHEMINF_000121 pt:Element An element in the periodic table.

os:ElementClassification
Classification of elements in the periodic table.
• Subclass of os:Classification os:ElementGroupNumber Group number of an element in the periodic table.
• Subclass of os:Property os:ElementName Name of an element in the periodic table.
• Subclass of os:Identifier os:ElementPeriodNumber Period number of an element in the periodic table.
• Subclass of os:Property os:ElementSymbol Symbol of an element in the periodic table.
• Subclass of os:Identifier os:EnthalpyOfSublimation The enthalpy (or heat) of sublimation is the amount of energy that must be added to a mole of solid at constant pressure to turn it directly into a gas (without passing through the liquid phase).
• Subclass of os:ThermoProperty The lowest temperature at which a liquid can gives off vapor to form an ignitable mixture in air near the surface of the liquid.
• Subclass of os:ThermoProperty • Equivalent to CHEMINF:CHEMINF_000417 os:Frequency Frequency of the spectrometer.

os:FunctionalGroup
Specific groups of atoms within molecules that are responsible for the characteristic chemical reactions of those molecules.

GHS (Globally Harmonized System of Classification and
Labelling of Chemicals) is a United Nations system to identify hazardous chemicals and to inform users about these hazards.GHS has been adopted by many countries around the world and is now also used as the basis for international and national transport regulations for dangerous goods.
• Subclass of os:Classification

os:GroundLevel
Ground level of an element in the periodic table.
• Subclass of os:Property

os:HeatOfCombustion
The heat of combustion is the energy released as heat when a compound undergoes complete combustion with oxygen under standard conditions.
• Subclass of os:ThermoProperty os:HeatOfVaporization The heat (or enthalpy) of vaporization is the quantity of heat that must be absorbed if a certain quantity of liquid is vaporized at a constant temperature.

S-9 os:HeavyAtomCount
The number of non-hydrogen atoms.
• Subclass of os:Property • Equivalent to CHEMINF:CHEMINF_000300 os:HenrysLawConstant Henry's law states that the amount of dissolved gas (in liquid, such as water) is proportional to its partial pressure in the gas phase.The proportionality factor is called the Henry's law constant and defined as the ratio of a compound's partial pressure in air to the concentration of the compound in water at a given temperature.

os:IonizationMode
Ionization mode used in the mass spectrometry analysis.

os:IonizationPotential
Ionization potential, also called ionization energy, is the amount of energy required to remove an electron from an isolated atom or molecule.
• Subclass of os:Property • Equivalent to CHEMINF:CHEMINF_000191 os:IsoelectricPoint The isoelectric point, sometimes abbreviated to IEP, is the pH at which a particular molecule or surface carries no net electrical charge.
• Subclass of os:Property os:IsotopeAtomCount The sum of all atoms enriched with respect to a particular atom isotope.
• Subclass of os:Property The mass of a molecule calculated using the mass of the most abundant isotope of each element (e.g., Carbon has a monoisotopic mass of 12.000 g/mol).
• Subclass of os:SpectralInformation • Equivalent to CHMO:CHMO_0000835 os:OpticalRotation Optical rotation is a property of chiral substances that is expressed as the angle to which the material causes polarized light to rotate at a particular temperature, wavelength, and concentration.
• Subclass of os:MassSpectrometry os:OxidationStates Oxidation states of an element in the periodic table.
• Subclass of os:Property os:Peak A peak of the spectrum.

os:PolarSurfaceArea
The polar surface area is defined as the combined surface area belonging to oxygen and nitrogen atoms and hydrogen atoms bound to these electronegative atoms.
• Subclass of os:Property okin:Reference Provenance of data.

os:ReferenceState
Reference state of a thermodynamic property.

os:RotatableBondCount
A bond count that denotes the integer number of rotors in the molecule, generally single bonds torsion around which produces non-identical geometric molecular configurations.

S-14 os:SMILES
A SMILES is a structure descriptor that denotes a molecular structure as a graph.
• Subclass of os:Identifier • Equivalent to CHEMINF:CHEMINF_000018 os:Solubility The solubility of a substance is the amount of that substance that will dissolve in a given amount of solvent.
The default solvent is water, if not indicated.
• Subclass of os:ThermoProperty • Equivalent to CHEMINF:CHEMINF_000258 os:Solvent Solvent used in the NMR analysis.

os:Species
An ensemble of chemically identical molecular entities.S1 os:SpectraGraph Spectral data collected in a graph.

os:StandardEnthalpyOfFormation
The energy required to form 1 mole of a substance from its constituent elements.
These fingerprints are used for similarity neighboring and similarity searching.
• Subclass of os:Property

os:SurfaceTension
Surface tension is a contractive tendency of the surface of a liquid that allows it to resist an external force.It is measured as the energy required to increase the surface area of a liquid by a unit of area.
• Subclass of os:Property • Equivalent to CHEMINF:CHEMINF_000202 os:ThermoProperty Thermodynamic property of a chemical species.
• Subclass of os:Property om:Unit Unit of measurement.

os:Use
Application and role of a chemical species.

os:VaporDensity
The density of a gas or vapor relative to that of the reference gas.While some resources use the hydrogen gas as the reference gas for the vapor density calculation, many resources (particularly in relation to safety considerations at commercial and industrial facilities in the U.S.) defines the vapor density with respect to the density of air, which has an arbitrary value of one.If a gas has a vapor density of less than one it will generally rise in air.If the vapor density is greater than one the gas will generally sink in air.
• Subclass of os:ThermoProperty • Equivalent to CHEMINF:CHEMINF_000440 os:VaporPressure Vapor pressure (or equilibrium vapor pressure) is the pressure of a vapor in thermodynamic equilibrium with its condensed phases in a closed system.

• Equivalent to CHEMINF:CHEMINF_000419 os:Viscosity
Viscosity is a measure of a fluid's resistance to flow.It describes the internal friction of a moving fluid.
• Subclass of os:ThermoProperty os:XCoordinate Atom X coordinate in space.

os:XLogP3
XLOGP3is an atom-additive method that calculates log P by adding up contributions from each atom in the given molecule.
• CRITERION 2: Exclude co-solvents with high potential health impact.To do so, we removed all the species that have GHS safety statement related to risk of cancer and risk for unborn child.S5 • CRITERION 3: Exclude co-solvents with a boiling point higher than 423 K to reduce the cost of solvent recovery by distillation.S4

USE CASE 2: List of Suitable Co-Solvents
Table S2 reports a list of co-solvents for propan-2-ol to enable the easiest separation by distillation obtained querying OntoSpecies.First column reports a list of species selected by the SPARQL query in Figure S2 as suitable co-solvents (49 species).The second column reports a check-mark if the species follow criterion 1; the third column reports a check-mark if the species follow criteria 1 and 2 together; the firth column reports a check-mark if the species follow criteria 1, 2 and 3 together.Species that follow criterion 1 but are discarded by criterion 2 are highlighted in red (4 species).Species that follow all the criteria are highlighted in green (13 species).
Table S2: List of co-solvents for propan-2-ol to enable the easiest separation by distillation

USE CASE 3: SPARQL Queries
The SPARQL query that selects all the possible liquid products of the electrochemical CO 2 reduction S6 is shown in Figure S3.The query selects species with chemical formula C x H y O z and with x < 5 and z < 10.An additional filter on the boiling point (T b ) is added to remove all the species that are not expected to be found in the liquid phase at room temperature.
The selected T b = 15 • C is lower than then the experimental temperature (T = 25 • C) as to ensure all species that might partially be in the liquid phase are included.
The IRI of each species is then used in a second query (Figure S4) to get the 1 H NMR spectral information (peaks shift and intensity).A filter is used in the query to select only the first 1 H NMR spectra added to OntoSpecies from the PubChem record.This is done through the IRI indexing described in the "Assertion Component" section in the main text.

USE CASE 3: Subproducts Identification
In the NMR spectra in Figure 8 (main text), the triplet signals at 1.04 ppm do not exactly follow the expected 1:2:1 pattern (i.e.25%:50%:25%).Calculated peak ares give 30%:42%:27%, which indicates that another signal might overlap on the left side of the triplet in Figure 8 (main text).Also the signals of the quartet at 3.51 ppm do not exactly follow the expected 1:3:3:1 pattern (i.e.13%:37%:37%:13%).Calculated peak ares give 12%:33%:38%:17%, which indicates that another signal might overlap on the right side of the quartet in Figure 8 (main text).A list of possible subproducts can be then found querying for species whose highest peak can interfere with these two peaks (P 3 and P 6 ).The SPARQL query result is reported in Table S3.Check-marks in columns P 3 and P 6 indicate that the species interferes with P 3 and P 6 respectively.Species that interfere with both peaks are highlighted in blue.Table S3: List of possible subproducts for the NMR spectra in Figure 8 (main text).

Data Enrichment
CASE 1: List of Alkenes Boiling Points os:ExactMassMass of the most intense molecule peak in an MS spec, and when calculated denotes the mass of a molecule containing most likely isotopic composition for a single ran- is a systematic name which is formulated according to the rules and recommendations for chemical nomenclature set out by the International Union of Pure and Applied Chemistry (IUPAC).• Subclass of os:Identifier • Equivalent to CHEMINF:CHEMINF_000107 os:LCMS Data from liquid chromatography-mass spectrometry (LC-MS) experiments.• Subclass of os:MassSpectrometry • Equivalent to CHMO:CHMO_0000524 os:LogP Log P is the partition coefficient expressed in logarithmic form.The partition coefficient is the ratio of concentrations of a compound in a mixture of two immiscible solvents at equilibrium.This ratio is therefore used to compare the solubilities of the solute in these two solvents.Because octanol and water are the most commonly used pair of solvents for measuring partition coefficients, the Log P values listed in this section refer to octanol/water partition coefficients, unless indicated otherwise.• Subclass of os:Property • Equivalent to CHEMINF:CHEMINF_000251 os:LogS The base-10 logarithm of the aqueous solubility of this compound.• Subclass of os:Property os:MALDI MALDI (matrix-assisted laser desorption/ionization) is an ionization technique that uses a laser energy absorbing matrix to create ions from large molecules with min-Mass spectrometry (MS or mass spec) is a technique to determine molecular structure through ionization and fragmentation of the parent compound into smaller components.• Subclass of os:SpectralInformation • Equivalent to CHMO:CHMO_0000470 os:MeltingPoint The melting point is the temperature at which a substance changes state from solid to liquid at atmospheric pressure.When considered as the temperature of the reverse change (from liquid to solid), it is referred to as the freezing point.• Subclass of os:ThermoProperty • Equivalent to CHEMINF:CHEMINF_000256 os:MolecularFormula A molecular formula is a structure descriptor which identifies each constituent element by its chemical symbol and indicates the number of atoms of each element found in each discrete molecule of that compound.

Figure S4 :
Figure S4: SPARQL query selecting 1 H NMR spectra peaks information (chemical shift and intensity) of a specific species.#IRI# in the shown query text is replaced with the species IRI.

Table S4 :
List of species classified as alkenes in OntoSpecies ordered by number of carbon atoms, their experimental boiling points (T b−experimental ) taken from PubChem or GuideChem (black or blue color respectively) and their extrapolated boiling points obtained fitting Pub-Chem data with a cube root function (T b−predicted ).